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Abstract 

Multithreshold Entropy Linear Classifier (MELC) is a recent clas¬ 
sifier idea which employs information theoretic concept in order to 
create a multithreshold maximum margin model. In this paper we an¬ 
alyze its consistency over multithreshold linear models and show that 
its objective function upper bounds the amount of misclassified points 
in a similar manner like hinge loss does in support vector machines. 
For further confirmation we also conduct some numerical experiments 
on five datasets. 


1 Introduction 

Many of the existing machine learning classifiers are based on the 
minimization of some additive loss function which penalizes each miss- 
classification [6]. This class of models consists perceptron, neural net¬ 
works, logistic regression, linear regression, support vector machines 
(both traditional and least squares) and many others. For most of 
such approaches it is possible to prove their consistency, meaning that 
under assumption that our data is sampled i.i.d. from some unknown 
probability distributions, algorithm will converge to the optimal model 
in Bayesian sense with the sample size growing to infinity [8,[9J. While 
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this is quite natural to be consistent with a loss function which is be¬ 
ing directly minimized, it generally only upper bounds the number of 
wrong answers. 

In general, up to some weighting schemes, the classic measure of 
the classification error is the expected number of missclassified samples 
from some unknown distribution J 7 : 


E [yi + cl(xi)\(xi,yi) ~ J 7 }, 
which directly translates to 

for Z(p, p, x) = 1 py > 0. We call l the 0/1 loss function and use 
the Z 0 /i notation. As a result we can define an empirical risk over the 
training set as 

N 

'R'emp({{ x iiyi)}i= l) = Tv ^ , yj,Xj) , 

i=l 

which can be minimized over some family of classifiers cl. Unfortu¬ 
nately for 0/1 loss the resulting optimization problem is hard even for 
linear models. To overcome this issue many classifiers are constructed 
through optimization of some similar loss function which results in 
feasible problems. For example support vector machines change 0/1 
loss to so called hinge loss 


Ih(p, V, x) = max{0, 1 - py}, 

for y E { —1,+1}. It appears, that such problem in the class of 
linear classifiers is convex and so - easy to compute. There are 
two important aspects of hinge loss that make it a reasonable sur¬ 
rogate function. First, ln(p,y,x) = 0 —)> lo/i{p,y, x ) = C^] second 
lH(p,y,%) > Zo/i (p> 2 /> x )- I n other words, it is an upper bound of 
the 0/1 loss and when it attains zero then there are no missclassified 
points. 

In this paper we analyze Multithreshold Entropy Linear Classifier, 
a recently proposed [T] classifier which builds a multithreshold linear 

1 Implication is an equivalence relation up to scaling of the linear operator as hinge loss 
returns non-zero values for predictions in (—1,1) interval. 
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model using information theoretic concepts. It is a density based 
approach which cannot be easily translated to the language of additive 
loss functions. We show that this model is consistent with 0/1 loss 
over simple families of distributions and that in general it also upper 
bounds the 0/1 loss in the class of multithreshold linear classifiers and 
when it attains zero then there are no missclassified points. We also 
draw some intuitions to show how this model is related to other linear 
classifiers and conclude with some numerical experiments. 


2 Multithreshold Entropy Linear Clas¬ 
sifier 

Multithreshold Entropy Linear Classifier (MELC jT] ) is aimed at find¬ 
ing such linear operator v that maximizes the Cauchy-Schwarz Diver¬ 
gence [3] of kernel density estimation of each class projection on v. It 
appears that due to the affine transformation invariance of such prob¬ 
lem one can (and should, as shown in [lj) restrict to the unit sphere, 
meaning that \\v\\ = 1. 

There are many density based methods in particular one can per¬ 
form kernel density estimation of any dataset and simply classify ac¬ 
cording to which density is bigger. However, such an approach cannot 
work in general due to the curse of dimensionality and the fact that 
density estimation requires enormous number of points for reasonable 
results (number of required points grows exponentially with the data 
dimension). As a result, existing datasets can be used to approximate 
density to at most few dimensions while data can have thousands. 
This leads to a very natural concept of performing density estimation 
of low dimensional data projection, in particular one dimensional one, 
performed by MELC. 

For a given set of points X_, X + , its projection to the hyperplane v 
is simply ?; T X_, v T X+. Kernel density estimations using Silverman’s 
rule [7] is given by 

^ T X ± ](x):=p^ Y, V^ eX P(- "^r" 2 )- 

x±ex± 

where 

G± = (3lt|) 1/5std ( vTX ±)' 

Now to define the MELC objective function, we need some definitions, 
namely: 
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• cross information potential which, as shown in [l], is connected 
to minimization of the empirical risk 

ip x (/-,/+) = J f-(x)f+(x)dx. 

• Renyi’s quadratic cross entropy as defined in [5] is simply a neg¬ 
ative logarithm of ip x 

H x (/_,/+) = -ffi(ip x (/_,/ + )). 

• Renyi’s quadratic entropy is a Renyi’s quadratic cross entropy 
between pdf and itself 


H 2 (/) = H X (/,/). 

• Cauchy-Schwarz Divergence, optimized by the full MELC model 

Dcs(/-,/+) = 2H X (/_,/+) -H 2 (/_) -H 2 (/+). 

In particular, non-regularized MELC is prone to overfitting which can 
be easily summarized by the following observation. 

Observation 1. Given an arbitrary finite, consistent set of samples 
C x {— 1, +1} non-regularized MELC learns it with 
zero error for sufficiently small a. 

Proof. First let us notice, that any finite, consistent sample set is 
separable by some multithreshold linear classifier. In other words 

y {{xuyi)}? =1 3 ^i,j(v,Xi) ± (v, Xj ) 

Obviously, there are N 2 pairs of vectors which can violate this as- 
sumption. Each defining a family of linear projections that are pro¬ 
jecting them at the same point. V{j — {v : (v,Xi) — (v,Xj)} = {v : 
(v,Xi - Xj) = 0}, thus = av 2 . 

So it is sufficient to choose v € \ |Jj j v ij which is a non-empty 

set as for any d > 1 there are infinitely many possible angles that 
vectors can form with each axis, and for d = 0 all v^j = 0 (from the 
dataset consistency). 

In the worst case it results in a {N — 1)—multithreshold linear 
classifier. As a consequence, there exists such linear projection for 
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which the smallest margin between samples of this set is greater than 
zero. 

As it has been shown in [I] non-regularized MELC maximizes the 
smallest margin among all margins in multithreshold linear classifiers 
as a approaches 0. In the same time MELC will not learn these 
samples perfectly if and only if at least two samples are projected 
at the very same point, which is equivalent to the maximum of the 
smallest margin in the class of multithreshold linear classifiers for this 
sample is equal to 0, contradiction. □ 

In particular, this means that for small values of cr, without regu¬ 
larization, this model has infinite Vapnik-Chervonenkis dimension m, 
as many other density or nearest neighbours based approaches. In the 
following section we focus on more practical characteristics - whether 
this classifier is able to learn an arbitrary continuous distribution with 
smallest obtainable error in its class of models. This characteristic is 
called consistency and can be defined as 

Definition 1 (Consistency). Model M is called consistent with error 
measure E and family of distributions T in the class of models Ai if 
for any f G T M trained on the i.i.d. samples from f approaches 
minimum error as measured by E over all models in Ai on f with 
samples ; size going to infinity. 

3 Non-regularized MELC consistency 

In this section we focus on non-regularized MELC which searches for 
linear projection v (with norm 1) maximizing Renyi’s quadratic cross 
entropy of kernel density estimation of data projection: 

%* = argmaxH£(^ r X_], [py r X + ]), 

which makes a classification decision based on the estimated projected 
densities 

cl{x) = sign^xX+K®) - [*£xX-](z)). 

We show that such classifier is nearly consistent with the 0/1 loss in 
the class of all multithreshold linear classifiers. We also draw an anal¬ 
ogy between its approach to the one taken by support vector machines 
model (as well as other regularized empirical risk loss function mini¬ 
mization based models). Let us start with some basic definitions and 
notations. 
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Definition 2 (Expected accuracy). Given some classifier cl(x) : X -» 
{ — 1, +1} the expected accuracy over a distributions /_, /+ with priors 

p(-)»p(+) 

p(-) J max{0, — c/(x)}/_(x)dx + p(+) J max{0, cl(x)}f+(x)dx. 

For unbalanced datasets we might be more interested in measures that 
make both classes equally important despite their sizes (priors) which 
leads to the averaged accuracy (also known as balanced/weighted ac¬ 
curacy) . 

Definition 3 (Expected averaged accuracy). Given some classifier 
cl(x ) : X { —1,+1} the expected averaged accuracy (ignoring the 
classes’ priors) over a distributions /_,/+ is 

\ J max{0, — cl(x)}f-(x)dx + \ J max{0, cl(x)}f+(x)dx. 

Let us now compute the smallest obtainable error by multithreshold 
linear classifiers as measured by expected averaged accuracy (EAA). 

Proposition 1 (Multithreshold Linear Classifier EAA Bayes Risk). 
For the family of multithreshold linear classifiers, the smallest obtain¬ 
able EAA error for distributions /_,/+ equals 

U E AA(f-,f+) = min J min .{(v T f-)(x),(v T f+)(x)}dx. 

Proof. J min {{v T f-){x), (v T f+)(x)}dx simply expresses the probabil¬ 
ity of making a bad classification over whole data projection. For each 
point v T x , we have to classify it as a member of either /_ or /+ and 
obviously, we make an error when classifying any point x with prob¬ 
ability min {(v T f-)(x), (v T f+)(x)}. As a result, the projection which 
realizes the minimum of probability of an error is the one giving the 
greatest expected averaged accuracy. □ 

In the following sections we assume that the kernel density estima¬ 
tion approximating the data distribution is the actual distribution, as 
with the sample size growing to infinity kernel density estimation with 
Silverman’s rule [7] is guaranteed to converge to the true distribution. 
As a consequence each result regarding a property over distribution 
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is also true over finite sample in the limiting case. We also use the 
notation 


^eaaOg /-,/+) = J min {v T f-(x),v T f+(x)}dx, 

for the smallest obtainable multithreshold linear classifier missclassi- 
fication error for a given projection v. So in particular 

v opt = argmin 1Z EAA (v; /-,/+) 

V 

n EAA (f-, /+) = min7 ?. EA a(^; /-,/+)= ^-eaa {v opt ). 

V 

Let us begin with the simplest case, when there exists a perfect clas¬ 
sifier able to distinguish samples’ classes (case when Bayesian risk is 
0 ). 

Observation 2. Non regularized MELC is consistent with 0/1 loss 
on multithreshold linearly separable distributions. 

Proof. If two distributions are perfectly separable by a multithreshold 
linear separator then there exists a linear projection v op t such that 
common support of distributions projected on v op t has zero measure. 

|supp {vlptf-) n supp(vj pt /+)| = 0. 

Obviously, ip x ( v o P tf-i v optf+) = 0 as we integrate the function which 
is not equal to 0 only on the set o zero measure. 

Similarly \/v : ip x {v T f-,v T f+) — 0 -A |supp(^ T /_)nsupp(^ T / + )| = 
0 because if the integral of the product of two functions is equal to 
zero then only on the set of zero measure both of these functions can 
be non-zero. As a result the solution given by non-regularized MELC 
attains the Bayesian risk for this class of distributions. □ 

Let us now investigate the situation when data of each class come 
from a radial normal distributions. 

Observation 3. Non regularized MELC is consistent with 0/1 loss 
on radial normal distributions. 

Proof. Let us assume that we are given Gaussians with variances a 2 _ 
and respectively. 

/- <7 J /)./. = N{m + ,cr 2 + I) 
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It is easy to see that linear projections of these distributions form the 
family of one-dimensional normal distributions with variances , 4 
respectively and distance between their means in the [0, \\v T rri- — 
v T m+\\] interval. Optimal projection is given by v op t which maximizes 
the distance between these means, so v op t = ±(m_ — m + ). 

On the other hand according to Czarnecki et al. |TJ, we have 


ip x(v T f.,v T f + ) = 


V / 2^0 




exp 


v T mn- — v T m+\\ 2 \ 

2(4 + 4) )' 


so obviously ip x is minimized (and maximized) when || v T m- — 
v T m+ 1| 2 is maximized. As a result non-regularized MELC selects op¬ 
timal linear projection. □ 


Unfortunately MELC (neither regularized nor non-regularized) does 
not seem to be consistent with 0/1 loss in general. However, we show 
that 0/1 loss is nicely bounded by its objective function which will 
draw an analogy between this approach and those taken by other lin¬ 
ear models. 

We start with a simple lemma connecting square of the function’s 
integral and integral of the function’s square on a bounded interval. 


Lemma 1. For any square integrable function f such thatMx : f(x) > 


0 




Proof. This is an obvious consequence of Schwarz inequality 



g 2 (x)dx, 


for a = 0, b = 1, / being non-negative and g being a constant function 
equal c > 0, 





□ 


Now we can prove the main theorem of this paper. 
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Theorem 1. Negative log likelihood of minimal obtainable missclas- 
sification error of a given multithreshold linear classifier for any not 
multithreshold linearly separable distributions is at least half of Renyi’s 
quadratic cross entropy of data projections used by this classifier. 


Proof. First from the fact that we can scale/center data so for any 
linear operator v such that \\v\\ = 1 we have 

0 < sup(supp(v T /_)Usupp(v T / + ))-inf(supp(v T /_)Usupp(v T / + )) < 1, 

and consequently we can narrow down to the error over a unit interval 
. From Lemma [l] we get 

J mm{(v T f-)(x), (v T f + )(x)}dx < J (min{(u r /_)(a;), (u T /+)(o;)}) 2 da;. 

(1) 

For any a, b G R+ we have min{a, b} < y/ab, thus 

min {(v T f-)(x), (v T f+)(x)} < /(v T /_)(a;)(u T / + )(;c), 
which connected with 0 yields 

Keaa(v-, /_, /+) = / min .{(v T f-)(x), ( v T f+)(x)}dx < / (u T /_)(x)(u T /+)( x)dx, 

consequently, as /_, / + are not multithreshold linearly separable, 7 ^eaa(^; /-, /+) 
is strictly positive, thus 

-ln(^EAA (u;/-,/+)) > - In (jJj Q (^ r /-)(®)(^ T /+)(®)rfa: J = (u T /_, v T f + ). 


□ 


In other words by maximizing the Renyi’s quadratic cross entropy 
(minimizing the cross information potential) we should also optimize 
negative log likelihood of correct classification (get close to the Bayes 
risk of 0/1 error). It is worth noting that we do not assume any partic¬ 
ular kernel so even though MELC is defined with Gaussian mixtures 
kernel density estimation, the theorems holds for any square integrable 
distributions on [0,1] interval. 

2 for KDE based on functions with infinite support, for a proper scaling, integral of the 
pdf outside [0,1] interval goes to 0 with samples size growing to infinity 
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Figure 1: Visualization of sampled points for each dataset (first column), 
hinge loss and Bayesian risk of linear models (second column), underlying 
dataset distribution (third column) and finally square root of the cross in¬ 
formation potential and the Bayesian risk of multithreshold models (last 
column). X axis corresponds to the angle of the v vector. Large dots corre¬ 
spond to minima of each function, additionally for both hinge loss and y/ip x 
there is another dot denoting the value of true error obtained if solution is 
selected using these objectives. 


10 






























































































4 Experiments 

To further confirm our claims we perform simple numerical experi¬ 
ments on five datasets, three of which are synthetic ones and two real 
life examples. During this evaluation we analyze all possible linear 
models in two-dimensional space and compare how particular upper 
bound objective (hinge loss in the case of linear classifiers and non- 
regularized MELC for multithreshold classifiers) behaves as compared 
to the Bayesian risk. Figure [l] visualizes the results for: two radial 
Gaussians distributions (one per class) in 2d space; four radial Gaus- 
sians distributions placed alternately (two per class) in a line; four 
random strongly overlapping Gaussian distributions (two per class); 
fourclass dataset [2]; 2d PC A embedding of the images of 0 and 2s 
(positive class) and 3s and 8s from MNIST dataset [3]. 

First, it is easy to notice the convexity of the hinge loss objective 
function. Even for problems having multiple local optima (like fourth 
dataset) the SVM objective function has just one, global optimum 
which is the core advantage of such an approach. In the same time, 
non-regularized MELC function has similar number of local optima 
like the Bayesian risk function, however it is much smoother and as 
a result one of the unimportant local solution in terms of 0/1 loss in 
the fourth example (located near 0.5) is not a solution of MELC. 

On the other hand for datasets where the considered class of models 
is not sufficient (like third problem for linear model) hinge loss convex 
upper bounds leads to the selection of the point distant from the true 
optimum (see Table [I]). MELC on the other hand seems to better 
approximate the underlying Bayesian risk function and results in the 
solutions with comparable error (even if the solution itself is far away 
from the true optimum, like in the case of fourth dataset). 


5 Conclusions 

In this paper Multithreshold Entropy Linear Classifier is analyzed in 
terms of its consistency with 0/1 loss function in the class of multi¬ 
threshold linear classifiers. It has been shown that it is truly consistent 
with some simple distribution classes and that in general its objective 
function upper bounds the 0/1 loss in a similar manner as hinge or 
square losses upper bounds 0/1 loss. Experiments on the synthetic, 
low dimensional data showed that in practise, one can expect that 
optimization of MELC objective function truly leads to the nearly 
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dataset 

E (vh, lo/i) 

cos(v H ,v 0/1 ) 

E(u ip x, 77-eaa) 

cos (v ip x,vn EAA ) 

2 Gauss 2d 

6% 

1.00 

3% 

1.00 

4 Gauss in line 

0% 

0.96 

0% 

1.00 

4 Gauss mixed 

34% 

0.56 

5% 

1.00 

fourclass 

1% 

1.00 

7% 

0.05 

MNIST 

2% 

0.99 

1% 

1.00 


Table 1: Comparison of solutions given by optimization of hinge loss and 
optimal linear classifier and between non-regularized MELC and optimal 
multithreshold linear classifier. Error function is the relative increase in the 
corresponding error measure when using a particular optimization scheme 
E(m,/) = • v h is a linear projection given by hinge loss opti¬ 

mization, vq/i by 0/1 loss optimization, u ip x by non-regularized MELC and 
v tz E aa the optimal multithreshold linear projection in the Bayesian sense. 


optimal classifier with sample size growing to infinity. 
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